Harvard Art Museum Dynamic Image Search and Recommendation¶
Final Project CSCI S-108 Data Mining, Discovery, and Exploration
Bridget Atchison-Nevel
Professor: Stephen Elston
May 2025
Executive summary¶
Problem Statement:¶
The Harvard Art Museums Collection's Catalog is a a digital repository of all 250,000 art objects contained in the collection. The current public access point for searching this extensive collection is a search bar on the HAM website that performs a keyword search, often yielding less-than-optimal results.
The current search is performed by a SQL-style based lookup call to the database. However, such a lookup frequently fails to return relevant results when the keywords are visual in nature. For instance, the search term ‘blue painting’ returns both non-painting items and paintings that aren’t blue. Another issue is that the search seems to be strict, as ‘ancient Chinese vase’ returns no records, while ‘Chinese vase’ returns 132 items, many of which are ancient. Furthermore, any mistyping or spelling error in the search terms results in no records because there’s no natural language processing (NLP) applied to the search terms, causing ‘photo’ and ‘photograph’ to return vastly different numbers of items.
This limited search approach is inadequate for anyone seeking to perform a search based on visual elements. My objective is to develop a search and recommendation system that returns relevant items and enables users to ‘like’ their preferred images. The recommender will then return images that closely resemble ‘liked’ images. Initially, this search will begin with a user generated query and then will continue to be driven by user preferences.
Value of Project¶
The Harvard Museum Collections provide a digital platform for anyone interested in exploring their extensive collection. Users can browse curated collections or use the search bar to find specific items. However, the search function is limited and lacks the ability to prioritize personal preferences. As consumers accustomed to algorithmic searches, we often overlook systems without optimized search or recommendation features.
One challenge is that cultural institutions like museums, libraries, and public archives are not consumer-driven and lack the financial resources to develop these tools. In our increasingly capitalist society, AI resources are heavily focused on corporate escapes, leaving institutions serving the arts behind.
This project is crucial because it serves individuals seeking art for academic or personal reasons. While there’s no financial incentive, investing in this project is essential.
Results¶
I have created a dynamic system that successfully collects a user query and returns images of the most similar objects in the database. I have provided a Jupyter notebook widget-based UI where the user is able to click on images to 'like' them. Clicking again will unlike the images. At any point, the user is able to query the database for new recommendations based on their liked images. There is a button to clear all likes. The search bar additionally remains throughout the process, and a new query-based search can be performed at any time.
Discussion of Techniques Employed¶
Data Selection¶
I have used the 'object' table of the HAM database as well as the ‘annotations' table. There are approximately 250,000 object records. I excluded objects that did not have Machine Generated Descriptions. The final count of objects available for my recommender query is 117,243. Each object record has 79 fields. I spent a considerable amount of time exploring the ability of available fields to identify the objects. I began working with a toy version of the data, 24,000 records long and eventually pulled all 117,243 records into the recommender.
In the final product, each object is represented as both an image: CLIP embedded image vector, and as a text: a tf-idf matrix vector (created from the merged Machine generated annotations and record text fields. )
Embedding with CLIP¶
I used OpenAI’s CLIP (Contrastive Language–Image Pre-training) to vectorize the images. CLIP is a joint image and text embedding model trained using 400 million image and text pairs in a self-supervised way. This model has image processing capabilities similar to ImageNet and ResNet but performs better with high level sematics and is known to be better at processing artwork.(Conde and Turgutlu, 2021) I had originally planned to embed the query text with CLIP and be able to search the image space with the text. This approach did not return good results. The problem may have been that often the user query text is just a few words. This approach may work better with longer search queries.
Once embedded using CLIP, the HAM art objects were easy to compare via cosine similarity. The results were quite stunning.
Embedding with TF-IDF using HAM ML Annotations¶
To create a text embedding, I merged the AI-generated descriptions with the object record fields: title, period, classification, division, data, department, artist, culture, and medium. The AI annotations are rich descriptions of object images but lack historical data like artist and date, so appending these fields creates a more complete representation to be used for text search. I then chose Tf-Idf (Term Frequency Inverse Document Frequency) as the text transformation to use for comparison. This approach made sense. Tfidf has been successfully used in other search engines as well as text classification. Ultimately, these text vectors are compared using cosine similarity for a similarity metric.
Similarity Searches and Vector Store FAISS¶
I indexed each of my embedding sets with Faiss (Facebook AI Similarity Search). This library allows a nearest neighbor search over a large set at some amazing speeds. There are options to use L2 norms, or cosine distance for similarity, and techniques like LSH or Hierarchical Navigable Small World graph exploration. There are options to encode vectors with different levels of compression, trading for speed and size. I explored these options and found that for my dataset size, the standard FlatIndex search with cosine similarity metrics was the best option.
¶
Flow for Dynamic Search and Recommedation System¶
After much consideration and trial and error, I have transformed each art object record into just 2 metrics that will be used within the recommendation process:
- CLIP-based image vectors (using FAISS flat index cosine similarity search for comparison)
- Textual data (title, artist, data, division, culture, etc.), concatenated with ML-created annotations and then vectorized with TF-IDF. (again using FAISS flat index cosine similarity search for comparison)
The image search and recommendation flow will be:
User is prompted for a text query
- The query is used to search the vector database; top 12 matching images are displayed. Each is displayed with a measure of its similarity for demonstration purposes.
User then may choose to like images (with heart), enter a different search term, get recommendations based on likes or to clear the search
Options:
- User selects liked images by clicking on hearts. Likes are then saved to an internal class list. Likes can also be un-clicked and removed from list.
- Get recommendations based on likes. The image vectors of the liked images are averaged and then a search is run in the faiss vector store for the most similar images to the mean vector. The top 12 which have not already been seen are displayed.
- Clear Likes will clear the past liked list so that future recommendations are not based on the cleared likes.
- At any point, the user can enter a new text search and continue liking images for more recommendations.
This flow becomes a dynamic search. New search terms can be input, likes can be manually liked or unliked, or all cleared. Past displayed images are always available, and new recommendation queries are always possible.
Methods attempted with less than optimal results¶
I tried a number of strategies in processing the data, including:
Transforming database Hex colors into RGB format and then transforming the RGB values and percentages into weighted histograms. I compared the histograms using cosine similarity. The results of this technique were just okay. Using CLIP embedded images worked far better, so I stuck with that technique.
Encoding text fields from each image into CLIP to make an embedding of the text. In theory, these embeddings should be in a shared space with images. I hoped that I could embed the query text with CLIP and search the embedded text fields for nearest neighbors. This did not work well at all. The problem was likely that there was not enough database nor query text to make robust embeddings.
I made an extensive list of art-based terms and gave this to a BERT transformer to use to label art images. I thought that I could use the labels to compare art. This was okay, but did not work as well as my other techniques.
Turned HEX colors into color names using a package. I was going to try to compare image colors by these lists of color names. The results here were also not very useful. Using the colors given in the ML-created annotations seemed to work better.
Ideas for future implementations¶
Possible Improvements:
- Order recommendations by popularity rank. (or inverse)
- Cluster Liked images then average the embeddings from each cluster. Use the cluster mean vectors to search for recommendations. This technique would allow users to be recommended different categories of images from the same query.
- Query image embedding index for each liked image and then randomly display results. This would take more compute and time, but with under one million images, I think the difference in time would be negligible. As I show below, The Faiss Flat Index query for 117 thousand images was just 0.005 seconds.
Limitations:
When for merging AI annotations with objects, I chose to append all available annotations to each id. Using tfidf this technique poses a problem. I noticed that some images happen to have more annotations generated than others, meaning that some key words will have a higher term frequency if they have more annotations. This creates a bias in my recommender. To remedy this, I should annotate each image with just one annotation instead of multiple.
Conclusion¶
The use of Machine Learning Models to create text captions and descriptions of objects is useful for creating quick keyword searches over vast spaces. Using the well-understood and frequently used Tf-idf to search through AI annotations is a useful blend of old and new technology. Additionaly using a vison model like CLIP to embed visual art images provides a robust means to represent images for fast comparison. While the embedding process is time consuming (with a MAC M4 chip about 1200 image per hour.) After embeddings are created, using a vector store like FAISS for similarity searches is very fast. I would love to work more with these embeddings using clustering to gain insights into the collections.
Use of Machine Generated Code¶
The model Deepseek was used to debug code as well as to generate a portion of the UI for the recommendation system, namely the image cards and the accordion sections in the recommendation class. The validation of this code involves using the Recommender. I have performed approximately 50 searches post-building and have not seen any bugs or unusual behavior in the UI. The prompts used included: "I have this code (paste code) and I would like to put like buttons with my images that will save the liked indexes to my class variable.", also "I would like to try to find a way to display all history of search but make it collapsible. (paste code)"
Validation of machine-generated annotations :
I visually compared about 30 of the AI-generated annotations. Most were complete and accurate descriptions. There were a few mistakes, for instance, reporting that a sketch of a head was a hand in the corner of a drawing. Because these mistakes were few and minor, I feel comfortable using the annotations with tfidf. I feel the length and depth of the descriptions outweigh minor mistakes when using an embedding like tfidf.
# import standard libraries
import pandas as pd
import numpy as np
import requests
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import re
import os
import webcolors
from dotenv import load_dotenv
from io import BytesIO
from joblib import dump, load
import threading
from concurrent.futures import ThreadPoolExecutor
import time
# ML libaries
from transformers import CLIPProcessor, CLIPModel
import clip
import torch
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import faiss
from sklearn.model_selection import ParameterGrid
# for displaying Images
from PIL import Image
from IPython.display import display, clear_output
from IPython.display import Image as Image_show
import ipywidgets as widgets
# for processing text
import spacy
from textblob import TextBlob
from gensim.models import KeyedVectors
import gensim.downloader as api
import json
import ast
# set options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import warnings
warnings.filterwarnings('ignore')
# Load the .env file
load_dotenv()
api_key = os.environ.get("API_KEY")
Data Extraction and Processing¶
Code to extract and format object records from database¶
The database itself is a relational database called TMS (The Museum System.) I have been told by Harvard Museum staff that there is a SQL wrapper that allows SQL like lookups via an API call. ‘The Harvard Art Museums API is a REST-style service designed for developers who wish to explore and integrate the museums’ collections in their projects. The API provides direct access to the data that powers the museums' website and many other aspects of the museums.’ The format of API calls is as follows:
https://api.harvardartmuseums.org/RESOURCE_TYPE?apikey=YOUR_API_KEY
Data is returned in JSON format. With a limit of 100 records per page. Through next and prev fields in the JSON info block I am able to page through records and load them all into a single dataframe.
def append_records(df, url):
"""
Loads one page of records from Harvard Art database and appends to a dataframe.
Usage:
while url:
df, url = append_records(df, url)
Args:
df: dataframe to append new records to
url: url page to pull new records from
Returns:
df: new dataframe with added records
url: url if next page is available, else None
"""
r = requests.get(url)
# Convert data to JSON format
data = r.json()
# Extract the info
info = data['info']
# concat dataframe
df2 = pd.DataFrame(data['records'])
df = pd.concat([df, df2], axis = 0)
# test if at the end of data
try:
if (info['next']):
url = info['next']
except:
return df, None
return df, url
# create inital dataframe
df = pd.DataFrame()
# Extract all fields of all records
url = 'https://api.harvardartmuseums.org/object?fields=*&apikey=' + api_key + '&size=100'
# build dataframe from query
while url:
df, url = append_records(df, url)
df.to_csv('art_objects.csv')
# load to dataframe
df = pd.read_csv('art_objects.csv')
print(f'The size of the dataframe is {df.shape}')
The size of the dataframe is (245924, 79)
Here I am keeping a subset of the records the subset includes text based information, an image url, and a popularity metric¶
keep_list = ['id', 'objectid', 'images','division', 'images','colorcount', 'period', 'classification', 'technique',
'medium', 'title', 'colors', 'dated', 'dateend', 'department', 'people',
'primaryimageurl', 'url', 'rank'
]
df = df[keep_list]
df.shape
(245924, 19)
I am extracting an imageid from a column so that I can merge AI generated annotations indexed by the imageid from a different part of the HAM database.¶
imageid_list = []
for i, row in df.iterrows():
try:
data = ast.literal_eval(row['images'].iloc[0])
imageid_list.append(data[0]['imageid'])
except:
imageid_list.append(np.nan)
df['imageid'] = imageid_list
df.drop('images', axis=1, inplace=True)
df.head(1)
| id | objectid | division | colorcount | period | classification | technique | medium | title | colors | dated | dateend | department | people | primaryimageurl | url | rank | imageid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36091 | 36091 | Modern and Contemporary Art | 5 | NaN | Photographs | Negative, gelatin silver (35mm film) | NaN | [New York City skyline with people on ship deck] | [{'color': '#191919', 'spectrum': '#1eb264', 'hue': 'Grey', 'percent': 0.4116666666666667, 'css3': '#000000'}, {'color': '#323232', 'spectrum': '#2eb45d', 'hue': 'Grey', 'percent': 0.35546666666666665, 'css3': '#2f4f4f'}, {'color': '#4b4b4b', 'spectrum': '#3db657', 'hue': 'Grey', 'percent': 0.11113333333333333, 'css3': '#2f4f4f'}, {'color': '#000000', 'spectrum': '#1eb264', 'hue': 'Black', 'percent': 0.09266666666666666, 'css3': '#000000'}, {'color': '#7d7d7d', 'spectrum': '#8362aa', 'hue': 'Grey', 'percent': 0.029066666666666668, 'css3': '#808080'}] | 1936 | 1936 | Busch-Reisinger Museum | [{'role': 'Artist', 'birthplace': 'New York, NY', 'gender': 'male', 'displaydate': '1871 - 1956', 'prefix': None, 'culture': 'American', 'displayname': 'Lyonel Feininger', 'alphasort': 'Feininger, Lyonel', 'name': 'Lyonel Feininger', 'personid': 20002, 'deathplace': 'New York, NY', 'displayorder': 1}] | https://nrs.harvard.edu/urn-3:HUAM:INV207929C_dynmc | https://www.harvardartmuseums.org/collections/object/36091 | 191025 | 379813.0 |
Loading AI annotations and adding to the primary dataframe¶
The AI annotation database is a group of 67,313,736 publicly accessible machine-generated annotations covering 402,949 images. Annotations are generated from 5 computer vision services and 6 large language models. See References for a full list. The types of annotations available are description, face, tag, and text. For my purposes, I pulled about 400,000 description annotations and linked them with the objects that they reference. These descriptions are at the core of my text-based image lookup function. The descriptions themselves come from models: Amazon Nova, Anthropic Claude, Azure OpenAI GPT, Clarifai, and Google Gemini. Annotations date from 2018-02-08 to 2025-05-06.
You can read more about the annotations here [https://ai.harvardartmuseums.org/statistics]
# loading ml annotations extracted from the HAM database
desc_df = pd.read_csv("an_descriptions.csv")
# extract just description and id
desc_df = desc_df[['imageid', 'body']].copy()
# aggregate multiple image descriptions into one record
agg_functions = {'body': lambda x: ' '.join(str(s) for s in x if pd.notna(s))}
# Create new dataframe
df_new = desc_df.groupby('imageid', as_index=False).aggregate(agg_functions)
# add descriptions to main dataframe
df['imageid'] = pd.to_numeric(df['imageid'], errors='coerce').astype('Int64')
df = df.merge(df_new, on='imageid', how = 'left')
df[~df.body.isna()].shape
(117243, 19)
Dropping records with no annotations well as creating a toy version (20%) of the dataset to work with¶
We have 117,243 records with descriptions to work with. I started with 24,000 images and built the system. Once everything was working I added 30,000 images at a time until I was using all 117,234.
# dropping images without descriptions
new_df = df[~df['body'].isna()]
print(new_df.shape)
(117243, 19)
# do not rerun this!
#toy_df = new_df.sample(n = 24000)
#toy_df.to_csv("unaltered_toy_df.csv")
# load 20000 sampled records
sample1_df = pd.read_csv("unaltered_toy_df.csv")
# get the remainder of unsampled records to process next sample
next_df = new_df[~new_df['id'].isin(sample1_df['id'])]
# sample next set of 30000 images
print(next_df.shape)
sample2_df = next_df.sample(n = 30000)
(93245, 19)
# do not rerun this!
#sample2_df.to_csv("unaltered_toy_df2.csv")
# sample next set of 30000 images
next_df = new_df[~new_df['id'].isin(sample1_df['id'])]
next_df = next_df[~next_df['id'].isin(sample2_df['id'])]
sample3_df = next_df.sample(n = 30000)
# do not rerun this!
# sample3_df.to_csv("unaltered_toy_df3.csv")
sample3_df.shape
(30000, 19)
# grab last 33245 images
sample1_df = pd.read_csv("unaltered_toy_df.csv")
sample2_df = pd.read_csv("unaltered_toy_df2.csv")
sample3_df.to_csv("unaltered_toy_df3.csv")
exclude_dfs = [sample1_df, sample2_df, sample3_df]
exclude_ids = pd.concat([df['id'] for df in exclude_dfs]).unique()
sample4_df = new_df[~new_df['id'].isin(exclude_ids)]
print(sample4_df.shape)
(33245, 19)
# do not rerun this!
#sample4_df.to_csv("unaltered_toy_df4.csv")
Processing Images¶
I embed the images with OpenAI’s CLIP model, store them in Facebook's FAISS using a flat index cosine similarity search to find similar images.¶
- Here I use some optimizations for my Apple M4 chip to speed up the vectorizing process. The bottleneck here is the amount of time that it takes to download each image from the Harvard URL. I used Python’s threadpooling to download multiple URLs at once. This process took about 2 hours for 24,000 images.
- As I develop the project, I am adding additional records and embedding more images. The code is below.
device = (
"mps" if torch.backends.mps.is_available()
else "cpu"
)
# Load CLIP
model, preprocess = clip.load("ViT-B/32", device=device)
if device == "mps":
# Use float16 to improve speed
model = model.half()
def download_image(url, timeout=20):
"""Download image with timeout setting"""
try:
response = requests.get(url, timeout=timeout, stream=True)
response.raise_for_status()
return Image.open(BytesIO(response.content)).convert("RGB")
except Exception as e:
print(f"Failed to download {url}: {str(e)}")
return None
def embed_image(image, batch_size=32):
"""Process single image"""
# Return zero vector for missing images
if image is None:
return np.zeros(512, dtype=np.float32)
with torch.no_grad():
image_tensor = preprocess(image).unsqueeze(0).to(device)
if device == "mps":
image_tensor = image_tensor.half()
return model.encode_image(image_tensor).cpu().numpy().flatten()
def process_batch(urls):
"""Process batch of URLs in parallel"""
# use threadpool to process multiple images at once for speed
with ThreadPoolExecutor(max_workers=8) as executor:
images = list(tqdm(executor.map(download_image, urls), total=len(urls)))
return [embed_image(img) for img in images]
# Process first sample of data in batches
batch_size = 64
all_embeddings = []
for i in tqdm(range(0, len(df), batch_size)):
batch_urls = df['primaryimageurl'].iloc[i:i+batch_size].tolist()
batch_embeddings = process_batch(batch_urls)
all_embeddings.extend(batch_embeddings)
df['image_embedding'] = all_embeddings
# do not overwrite this!!
#df.to_csv("toy_image_emb2.csv")
# Process second sample of DataFrame in batches
batch_size = 64
all_embeddings = []
for i in tqdm(range(0, len(sample2_df), batch_size)):
batch_urls = sample2_df['primaryimageurl'].iloc[i:i+batch_size].tolist()
batch_embeddings = process_batch(batch_urls)
all_embeddings.extend(batch_embeddings)
sample2_df['image_embedding'] = all_embeddings
sample2_df.to_csv("toy_image_emb3.csv")
# Process second sample of DataFrame in batches
batch_size = 64
all_embeddings = []
for i in tqdm(range(0, len(sample3_df), batch_size)):
batch_urls = sample3_df['primaryimageurl'].iloc[i:i+batch_size].tolist()
batch_embeddings = process_batch(batch_urls)
all_embeddings.extend(batch_embeddings)
sample3_df['image_embedding'] = all_embeddings
#sample3_df.to_csv("toy_image_emb4.csv")
# Process second sample of DataFrame in batches
batch_size = 64
all_embeddings = []
for i in tqdm(range(0, len(sample4_df), batch_size)):
batch_urls = sample4_df['primaryimageurl'].iloc[i:i+batch_size].tolist()
batch_embeddings = process_batch(batch_urls)
all_embeddings.extend(batch_embeddings)
sample4_df['image_embedding'] = all_embeddings
sample4_df.to_csv("toy_image_emb5.csv")
df_1 = pd.read_csv("toy_image_emb2.csv")
df_2 = pd.read_csv("toy_image_emb3.csv")
df_3 = pd.read_csv("toy_image_emb4.csv")
df_4 = pd.read_csv("toy_image_emb5.csv")
df = pd.concat([df_1, df_2, df_3, df_4])
df.shape
(117245, 26)
Creating a Faiss Flat Index from the Image Embeddings¶
- I reformat the embeddings that were saved as csvs into floats so that they can be put into the Faiss index
- Create a Faiss FlatIndex Faiss using the IP option that will compute cosine similarity as long as all vectors are normalized before adding to the index. I save this index to file to be used later by the recommender.
def convert_embedding(x):
''' Converts embeddings saved as cvs from strings back into floats'''
if pd.isna(x) or x is None:
return []
try:
# Remove brackets and split into numbers
nums = re.sub(r'[\[\]]', '', str(x)).split()
return [float(num) for num in nums]
except:
return []
# Apply conversion
df["image_embedding"] = df["image_embedding"].apply(convert_embedding)
# Convert embeddings to numpy arrays to give to faiss
image_embeddings = np.array(df["image_embedding"].tolist(), dtype="float32")
# check to make sure embeddings are normalized so that IP computes cosine similarity
norms = np.linalg.norm(image_embeddings, axis=1) # Compute L2 norms
print("Min norm:", np.min(norms), "Max norm:", np.max(norms))
# normalize to get best results
faiss.normalize_L2(image_embeddings)
# Create Faiss Flat index
dim = image_embeddings.shape[1]
n_bits = 8 # Balance speed/accuracy
image_index = faiss.IndexFlatIP(dim)
image_index.add(image_embeddings)
# Save Index to be used later in recommender
faiss.write_index(image_index, "image_index4.faiss")
print(image_embeddings.shape)
Min norm: 0.0 Max norm: 13.022693 (117245, 512)
Here I check to see what kind of results the search for similar images returns using the CLIP embedded images and the FAISS index¶
The image that i randomly choose looks like an Asian painting of a branch with caligraphy. The returned images are very similar. This technique works really well!
# Start with random image at location index 64
idx = 64
image_embedding =[]
url = df.iloc[64]['primaryimageurl']
print("Search Image:")
display(Image_show(url = url, width = 300))
indices = []
# Query
k = 5
image_embedding = np.ascontiguousarray(
[df.iloc[idx]['image_embedding']],
dtype="float32"
)
faiss.normalize_L2(image_embedding)
# query the faiss index to find most similar images
distances, indices = image_index.search(image_embedding, k)
print(f"Returned Indices: {indices}\n Returned Images:")
for i in indices[0][1:]:
url = df.iloc[i]['primaryimageurl']
display(Image_show(url = url, width = 300))
Search Image:
Returned Indices: [[ 64 48037 112874 90819 84852]] Returned Images:
Another Example¶
Here the randomly chosen image is a sepia image of a person with their hands outstretched looking to the sky. Again all of the returned images look very similar to the query.
idx = 18901
url = df.iloc[18901]['primaryimageurl']
print("Search Image:")
display(Image_show(url = url, width = 300))
image_embedding = df.iloc[18901]['image_embedding']
indices = []
# Query
k = 10
image_embedding = np.ascontiguousarray(
[df.iloc[idx]['image_embedding']],
dtype="float32"
)
faiss.normalize_L2(image_embedding)
distances, indices = image_index.search(image_embedding, k)
print(f"Returned Indices: {indices}\n Returned Images:")
for i in indices[0][1:]:
url = df.iloc[i]['primaryimageurl']
display(Image_show(url = url, width = 300))
Search Image:
Returned Indices: [[ 18901 108368 99605 54417 98709 106333 26451 27944 68702 100739]] Returned Images:
Let’s look at the speed and accuracy of the different FAISS search techniques.¶
Surprisingly, all four of the indexes that I tested produced the exact same results as the Flat index, which is considered to be ground truth for accuracy. They had slightly different speeds. I plot them below.
k = 100
idx = 200
dim = image_embeddings.shape[1]
# normalize query vector
image_embedding = np.ascontiguousarray(
[df.iloc[idx]['image_embedding']],
dtype="float32"
)
faiss.normalize_L2(image_embedding)
# test 4 different FAISS Indicies for speed and acuracy
results = []
# Flat index will be the baseline
image_index = faiss.IndexFlatIP(dim)
image_index.add(image_embeddings)
start_time = time.time()
distances, ground_truth_flat_indices = image_index.search(image_embedding, k)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elasped time for Flat Search is {elapsed_time} seconds.")
results.append(['Flat', elapsed_time, k])
# Hierarchical Navigable Small World
index = faiss.IndexHNSWFlat(dim, 32) # dim, M
index.hnsw.metric_type = faiss.METRIC_INNER_PRODUCT
index.hnsw.efConstruction = 40
index.add(image_embeddings)
start_time = time.time()
distances, indices = image_index.search(image_embedding, k)
end_time = time.time()
elapsed_time = end_time - start_time
accuracy = len(set(ground_truth_flat_indices[0]) & set(indices[0]))
results.append(['Hierarchical Navigable Small World',elapsed_time, accuracy])
# Locally Sensitve Hash
index = faiss.IndexLSH(dim, 512)
index.add(image_embeddings)
start_time = time.time()
distances, indices = image_index.search(image_embedding, k)
end_time = time.time()
elapsed_time = end_time - start_time
accuracy = len(set(ground_truth_flat_indices[0]) & set(indices[0]))
results.append(['LSH', elapsed_time, accuracy])
# IVFADC (coarse quantizer+PQ on residuals)
quantizer = faiss.IndexFlatL2(dim) # Quantizer must be provided
index = faiss.IndexIVFPQ(quantizer, dim, 100, 32, 8)
index.train(image_embeddings)
index.add(image_embeddings)
start_time = time.time()
distances, indices = image_index.search(image_embedding, k)
end_time = time.time()
elapsed_time = end_time - start_time
accuracy = len(set(ground_truth_flat_indices[0]) & set(indices[0]))
results.append(['IVFADC (coarse quantizer+PQ on residuals)',elapsed_time, accuracy])
Elasped time for Flat Search is 0.004984855651855469 seconds.
results_df = pd.DataFrame(results, columns = ["Type", "Time", "Accuracy"])
display(results_df)
sns.barplot(data= results_df, y = 'Type', x = 'Time')
plt.title("Time to return top 100 nearest neighbors in Index of Size 117,245.")
plt.xlabel("Time in seconds")
plt.xticks(rotation= 60)
plt.show()
| Type | Time | Accuracy | |
|---|---|---|---|
| 0 | Flat | 0.004985 | 100 |
| 1 | Hierarchical Navigable Small World | 0.003381 | 100 |
| 2 | LSH | 0.003214 | 100 |
| 3 | IVFADC (coarse quantizer+PQ on residuals) | 0.003403 | 100 |
Processing Text Fields¶
Failed attempt¶
Originally, I vectorized these text entries with CLIP and tried to use the resulting vectors in my similarity search for the original text query string. This did not work well at all. There just wasn't enough information in the object text fields, and using the LLM was time-consuming and bulky.
Best approach¶
Instead, I went back to the HAM database and pulled existing AI-generated annotations. I concatenated these with the database text fields, such as artist, title, etc. I then used TF-IDF via scikit-learn’s tfidf_vectorizer to compare the images with a text query using cosine similarity. This worked very well. I put the tfidf matrix into a Faiss vector store for quick comparison using a Flat Index.
Below, I am cleaning extra characters from fields, combining two date columns to make one complete date field, and pulling people’s qualities, such as name and culture, from the people dictionary.
# remove 'Department of"
df['department'] = df['department'].str.replace('Department of ', '')
# remove extra characters
df['period'] = df['period'].str.replace('c.|\(|\)', '', regex=True)
# merge two date columns to get a complete date columns
df['date'] = np.where(df['dateend'] == 0, df['dated'], df['dateend'])
def extract_from_people(people_entry, field):
""" Extract field information from people field."""
try:
if isinstance(people_entry, str):
# Safely evaluate string to Python object
data = ast.literal_eval(people_entry)
else:
# Already in object form
data = people_entry
# Handle case where data is a list of dictionaries
if isinstance(data, list):
for item in data:
if isinstance(item, dict) and field in item:
return item[field]
# Handle case where data is a single dictionary
elif isinstance(data, dict) and field in data:
return data[field]
return None
except (ValueError, SyntaxError):
return None
# create artist name and culture fields by extracting from the people field
person_fields = ['name', 'culture']
for field in person_fields:
df['artist_' + field] = df['people'].apply(extract_from_people, args = (field,))
# Create one field that will hold all of the text data for tfidf
df["search_text"] = (
df["title"].fillna(" ") + " | " +
df["period"].fillna(" ") + " | " +
df["classification"].fillna(" ") + " | " +
df["division"].fillna(" ") + " | " +
df["date"].astype(str).replace("nan", " ") + " | " +
df["department"].fillna(" ") + " | " +
df["artist_name"].fillna(" ") + " | " +
df["artist_culture"].fillna(" ") + " | " +
df["medium"].fillna(" ")
)
# remove extra separators
df["search_text"] = df["search_text"].str.replace(r"\s*\|\s*\|\s*", " | ", regex=True).str.strip("| ")
display(df[['search_text']][:5])
| search_text | |
|---|---|
| 0 | Friedrich Wilhelm von Schadow in His Studio, Painting "The Ascension of Mary" | Prints | European and American Art | 1845 | Prints | Henry Ritter | German | Lithograph, chine collé, on off-white wove paper |
| 1 | Studies of Women and Lokapurusha | Drawings | Asian and Mediterranean Art | 1855 | Islamic & Later Indian Art | | Gray-black and brown inks and watercolor on beige laid paper |
| 2 | Ceiling Fixture | Drawings | Modern and Contemporary Art | 1994 | Drawings | Catherine Murphy | American | Graphite on off-white wove paper |
| 3 | Branch of Plum Blossoms -- Illustration from the Ten Bamboo Studio Manual of Calligraphy and Painting (Shizhuzhai shuhua pu) | Ming 1368-1644 to Qing 1644-1911 dynasty | Prints | Asian and Mediterranean Art | 1703 | Asian Art | Hu Zhengyan 胡正言 | Chinese | One page from a woodblock printed book mounted as an album leaf; ink and color on paper |
| 4 | Spinner at the Door of the House | Prints | European and American Art | 1689 | Prints | Adriaen van Ostade | Dutch |
# combine formatted database text data with annotations
df['body'] = (df['body'] + df['search_text'])
display(df[['body']][:3])
| body | |
|---|---|
| 0 | The image depicts an artist's studio or workspace. In the center, an older man is seated in a chair, appearing to be an artist working on a painting or drawing. Behind him, there is a partially completed painting or artwork on an easel. In the background, a figure, likely a person, is standing and observing the artist at work. The room contains various artistic supplies, tools, and furnishings typical of a studio setting. This is a detailed pencil or lithograph drawing of what appears to be an artist's studio. The main focus is on an artist sitting in a relaxed position in front of a large canvas or painting. He's wearing a top hat and coat, and appears to be taking a break or contemplating his work. On the canvas, there's a sketch or outline of what looks like a religious figure. The studio space includes a window with panes on the left, some curtains, and various artist's materials and supplies scattered about, including what seems to be a table with tools or materials. In the background, there's another figure standing near a doorway. The drawing has a casual, candid quality that captures a moment in the daily life of an artist at work. The title or signature at the bottom appears to read "M. Sébastien."Friedrich Wilhelm von Schadow in His Studio, Painting "The Ascension of Mary" | Prints | European and American Art | 1845 | Prints | Henry Ritter | German | Lithograph, chine collé, on off-white wove paper |
| 1 | The image appears to be a surreal, abstract drawing or sketch. It contains various elements, including human figures, architectural structures, and abstract shapes and patterns. The style is loose and expressive, with a mixture of pencil, ink, and color markings. The overall composition is complex and layered, creating a dreamlike or subconscious atmosphere. There are no identifiable individuals in the image, only suggestive human forms and faces. This appears to be a historical sketch or drawing with multiple elements. The main figure in the lower left portion is drawn in a dynamic pose. The sketch includes various decorative elements in the upper right, including what appears to be architectural details with arches and ornamental designs. There are also small animal figures, possibly elephants, sketched in the upper portion of the drawing. The artwork has a loose, gestural quality with lines in different colors including red, black, blue and green. The paper appears aged, with a beige or off-white tone. The composition has a somewhat chaotic or energetic feel, with multiple figures and elements arranged across the page in what seems to be a preparatory or study drawing.Studies of Women and Lokapurusha | Drawings | Asian and Mediterranean Art | 1855 | Islamic & Later Indian Art | | Gray-black and brown inks and watercolor on beige laid paper |
| 2 | This black and white image shows a ceiling light fixture with a round, dome-shaped cover. The ceiling has a heavily textured, popcorn-like surface. There appears to be a pull cord or chain hanging down from one side of the fixture with a circular pull ring at the end. The light fixture itself looks somewhat aged or worn, with some visible marks or discoloration on its surface. The composition is simple and straightforward, shot from below looking directly up at the ceiling light. The image appears to be a black and white photograph depicting a circular object, likely a mirror or lens, set against a textured, abstract background. The background has a rough, granular texture, creating an interesting visual contrast with the smooth, reflective surface of the circular object in the center. The image has a somewhat surreal, dreamlike quality to it, with the isolated, centrally-placed circular form drawing the viewer's attention.Ceiling Fixture | Drawings | Modern and Contemporary Art | 1994 | Drawings | Catherine Murphy | American | Graphite on off-white wove paper |
Creating TF-IDF matrix and using it for cosine similarity comparisons against text query¶
I use scikit-learn’s TfidfVectorizer to transform each record into a vector that can be compared with the others based on the inclusion of words rare to the entire corpus. I did a grid search over possible hyperparameters. The difficulty with this technique is the absence of any natural ground truth. To create a senerio where I could hand the grid search a ground truth to evaluate with I reandomly selected 100 image/description pairs and manually created queries that I thought should be of high cosine similarity to the description/image pair. This technique is flawed for a number of reasons, one my inupt is subjective and two I am only embedding a set of 100 text samples. My actual modle will be embedding 117,000 texts so may need different parameters for that reason alone. But the grid search returned optimal parameters: max_features = 10000, and ngram_range = (1, 1), upon inspection I can see that altering the min_df parameter selection did not affect the scoring at all.
Upon seeing the results of the parameter grid search and trying different hyperparamters with my final recommender I decided to remove English stop words, use the ngram range of 1-2, meaning that I can capture just single words as well as phrases up to two words long. And I build a vocabulary that usses the top 15000 terms. This choice is appropriate for art terms and artist name queries, which are often 2 words or less. Here, I ignore words that are in less than two documents.
# first do a grid search for parameters on a small version of dataset
grid_df = pd.read_csv("ground_truth.csv")
# My ground truth sample queries
queries = [
'Artist in studio', 'Still life fruit', 'Ornamental pattern',
'17th century portrait', 'Historic Battle Scene', 'Medieval carving',
'Watercolor of boats', 'Boats in water', 'Military scene',
'Engraving of festival', '17th century Islamic manuscript',
'Simple Sculpture', 'Art with buildings', 'Cubist animal sculpture',
'Jael Slaying Sisera', 'Shards of ancient clay pots', 'Nude figure',
'Roman painting fragment', 'Art gallery photographic image',
'Saint Francis artwork'
]
# Ground truth mapping (query index : document index)
ground_truth = {
0: [0], 1: [8], 2: [11], 3: [12], 4: [13], 5: [15], 6: [17], 7: [69],
8: [22], 9: [25], 10: [30], 11: [35], 12: [59], 13: [52], 14: [61],
15: [62], 16: [65], 17: [77], 18: [3], 19: [86]
}
param_grid = {
'ngram_range': [(1, 1), (1, 2), (1, 3)],
'min_df': [1, 2, 5],
'max_features': [5000, 10000, 15000, 20000],
}
def evaluate_retrieval(tfidf, tfidf_matrix, queries, ground_truth):
"""Calculate average cosine similarity between queries and their ground truth indices."""
query_vecs = tfidf.transform(queries)
scores = []
for query_idx, doc_indices in ground_truth.items():
# Get ground truth vectors
gt_vecs = tfidf_matrix[doc_indices]
# Calculate similarity between query and its ground truth docs
sim = cosine_similarity(query_vecs[query_idx:query_idx+1], gt_vecs)
scores.append(np.mean(sim))
return np.mean(scores)
results = []
for params in tqdm(ParameterGrid(param_grid)):
tfidf = TfidfVectorizer(**params)
tfidf_matrix = tfidf.fit_transform(grid_df['body'])
score = evaluate_retrieval(tfidf, tfidf_matrix, queries, ground_truth)
results.append({
**params,
'score': score,
'vocab_size': len(tfidf.get_feature_names_out())
})
results_df = pd.DataFrame(results)
best_params = results_df.loc[results_df['score'].idxmax()]
print(best_params)
display(results_df.sort_values('score', ascending = False).iloc[:20])
100%|███████████████████████████████████████████| 36/36 [12:52<00:00, 21.46s/it]
max_features 10000 min_df 1 ngram_range (1, 1) score 0.246763 vocab_size 10000 Name: 9, dtype: object
| max_features | min_df | ngram_range | score | vocab_size | |
|---|---|---|---|---|---|
| 9 | 10000 | 1 | (1, 1) | 0.246763 | 10000 |
| 15 | 10000 | 5 | (1, 1) | 0.246763 | 10000 |
| 12 | 10000 | 2 | (1, 1) | 0.246763 | 10000 |
| 18 | 15000 | 1 | (1, 1) | 0.244175 | 15000 |
| 24 | 15000 | 5 | (1, 1) | 0.244175 | 15000 |
| 21 | 15000 | 2 | (1, 1) | 0.244175 | 15000 |
| 33 | 20000 | 5 | (1, 1) | 0.242891 | 20000 |
| 30 | 20000 | 2 | (1, 1) | 0.242891 | 20000 |
| 27 | 20000 | 1 | (1, 1) | 0.242891 | 20000 |
| 0 | 5000 | 1 | (1, 1) | 0.233667 | 5000 |
| 3 | 5000 | 2 | (1, 1) | 0.233667 | 5000 |
| 6 | 5000 | 5 | (1, 1) | 0.233667 | 5000 |
| 7 | 5000 | 5 | (1, 2) | 0.194848 | 5000 |
| 1 | 5000 | 1 | (1, 2) | 0.194848 | 5000 |
| 4 | 5000 | 2 | (1, 2) | 0.194848 | 5000 |
| 5 | 5000 | 2 | (1, 3) | 0.184996 | 5000 |
| 8 | 5000 | 5 | (1, 3) | 0.184996 | 5000 |
| 2 | 5000 | 1 | (1, 3) | 0.184996 | 5000 |
| 10 | 10000 | 1 | (1, 2) | 0.182122 | 10000 |
| 13 | 10000 | 2 | (1, 2) | 0.182122 | 10000 |
heatmap_data = results_df.pivot_table(
index='ngram_range',
columns='max_features',
values='score',
aggfunc='mean'
)
sns.heatmap(heatmap_data, annot=True, fmt=".2f", cmap="YlGnBu")
plt.title('Score by ngram range and min df')
plt.show()
# Create TF-IDF matrix
tfidf = TfidfVectorizer(
stop_words='english', # do not process common english words
ngram_range=(1, 3), # capture single words and phrases
min_df=2, # ignore very rare terms
max_features= 15000 # only consider top 15000 features
)
tfidf_matrix = tfidf.fit_transform(df['body'])
# save the vectorizer and matrix to file to load in the recommender later
dump({"vectorizer": tfidf, "matrix": tfidf_matrix}, "tfidf_data_test.joblib")
# convert sparse matrix to dense
dense_matrix = tfidf_matrix.toarray().astype("float32")
# build the FAISS index
tfidf_index = faiss.IndexFlatIP(dense_matrix.shape[1])
tfidf_index.add(dense_matrix)
# save this vector store to file for later use in recommender
faiss.write_index(tfidf_index, "tfidf_index_test.faiss")
Check to see if there is any difference in similarity results between FAISS and brute force cosine similarity¶
Here, I try both a brute force cosine search using tfidf and I also try putting the tfidf matrix into Faiss with a flat index for fast cosine search. The returned nearset neighbor vectors were identical with each other, so I will stick with using Faiss and, as my object size grows, may consider using a different Faiss search technique.
# Load and use
data = load("tfidf_data4.joblib")
vectorizer, tfidf_matrix = data["vectorizer"], data["matrix"]
# Query jumping cat
query = ["jumping cat"]
query_vec = vectorizer.transform(query)
# Find similar docs using FAISS
D, I = tfidf_index.search(query_vec.toarray().astype("float32"), k=8)
print("Top matches FAISS:", [i for i in I[0]])
# Find similar items using cosine similarity
cosine_sim = cosine_similarity(query_vec, tfidf_matrix)
top_indices = cosine_sim.argsort()[0][-8:][::-1]
print("Top matches Cosine_sim:", top_indices)
Top matches FAISS: [11417, 73615, 74015, 100367, 87813, 11296, 109618, 18960] Top matches Cosine_sim: [ 11417 73615 74015 100367 87813 11296 109618 18960]
Examples of search with text strings 'jumping cat' and 'vibrant abstract landscape'¶
# Query vibrant abstract landscape
query = "vibrant abstract landscape"
query_vec = tfidf.transform([query])
# Find similar items using cosine similarity
cosine_sim = cosine_similarity(query_vec, tfidf_matrix)
top_indices = cosine_sim.argsort()[0][-8:][::-1]
print(top_indices)
print("Query 'Jumping Cat'")
for i in I[0]:
url = df.iloc[i]['primaryimageurl']
display(Image_show(url = url, width = 300))
print("Query 'vibrant abstract landscape'")
for i in top_indices:
url = df.iloc[i]['primaryimageurl']
display(Image_show(url = url, width = 300))
[116726 3904 84119 58605 115847 83748 53100 15952] Query 'Jumping Cat'
Query 'vibrant abstract landscape'
query = "black and white family in a city"
query_vec = tfidf.transform([query])
# 3. Find similar items
cosine_sim = cosine_similarity(query_vec, tfidf_matrix)
top_indices = cosine_sim.argsort()[0][-5:][::-1] # Top 10 matches
for i in top_indices:
print(df.iloc[i]['body'])
url = df.iloc[i]['primaryimageurl']
display(Image_show(url = url, width = 300))
This is a historical print from the Salon of 1822, numbered 1945, titled "Une famille malheureuse" (An Unfortunate Family). It's a black and white lithograph showing a poignant scene of poverty and hardship. The composition depicts several figures in what appears to be a humble interior setting, with dramatic lighting and shadows. There are figures shown in poses suggesting distress and concern, with one standing and others seated. The artwork effectively conveys the emotional weight of the subject matter through its careful arrangement and detailed execution. The print is mounted on a green-tinted paper background, and the title and exhibition information are clearly visible below the image. The artist demonstrates skilled use of light and shadow to create a moving portrayal of human suffering and family bonds in difficult circumstances. The image depicts a family scene from 1822 based on the text at the top. It shows a mother seated with a young child on her lap, while two other children stand nearby. The setting appears to be inside a home or domestic space, with curtains visible in the background. The artwork is a black and white sketch or engraving with shading to convey depth and texture. The family is dressed in clothing typical of the early 19th century period. The bottom of the image has a title in French identifying it as "A Motherly Family". The image appears to be a print or engraving from the Salon de 1822 - 19 mars 1845. It depicts a domestic scene with several figures gathered around what seems to be a sickbed or medical treatment. The central figure appears to be a woman receiving care or attention from the others in the room. The image has a detailed, dramatic composition with strong chiaroscuro lighting effects. It seems to represent a moment of distress or illness in a family setting.An Unhappy Family | Prints | European and American Art | 18th-19th century | Prints | Pierre-Paul Prud'hon | French
The image appears to be a family portrait photograph from what looks like the 1950s or 1960s. It shows a group of around 10 individuals, both adults and children, seated and standing together. The individuals have varying hairstyles and clothing styles typical of that era, suggesting this is a multigenerational family portrait. The people in the image seem to be posing together in a formal, studio-style setting, conveying a sense of family unity and togetherness.Untitled (portrait of Mexican American family) | Photographs | Modern and Contemporary Art | 1960 | Photographs | Harry Annas | American
This is a historical illustration titled "La Famiglia dell'Autore" (The Author's Family) with "TUTTO FINISCE" (Everything Ends) written at the top. It's a domestic scene showing several people in classical dress gathered together, presumably in a home setting. The drawing style appears to be from the 18th or 19th century, with fine line work and cross-hatching techniques. The scene includes what appears to be a family group, with some members sitting while others stand. There are stacked books on a shelf above, and a skull visible among them, perhaps symbolizing mortality or scholarly pursuits. A small white dog stands at attention on the right side of the image. The drawing has a warm, intimate feeling despite being rendered in black and white. The composition is well-balanced and shows careful attention to detail in the clothing folds and interior setting. It appears to be capturing a moment of domestic life, possibly showing family members engaged in some kind of shared activity or discussion. The style and presentation suggest this might be part of a larger series or book illustration from the period. The image depicts a family sitting together in what appears to be their home, with the text "Tutto Finisce" (Italian for "Everything Ends") written at the top. The family consists of several women of varying ages sitting and working on some handiwork or crafts. There is a bookshelf filled with books in the background, and a small dog sitting near the family members. The drawing style has shading and detail to give it a somewhat realistic appearance, though the proportions and poses have a slightly exaggerated or caricatured quality. The overall scene suggests a domestic, everyday moment captured from 19th century life, likely depicting an upper or middle class Italian family based on their clothing and surroundings from that time period. The image appears to depict a family or household scene. There are several people, including what seem to be women and children, gathered in what looks like a domestic setting. Some of the individuals are seated, while others are standing or interacting with each other. There is also a bookshelf or shelving unit visible in the background, as well as a dog or similar pet in the foreground. The overall style of the image suggests it may be an illustration or engraving from an earlier time period.The Author's Family | Prints | European and American Art | 1810 | Prints | Bartolomeo Pinelli | Italian
The image appears to be a black and white sketch or drawing depicting a family scene. The central figure is a woman, possibly the mother, surrounded by several other individuals who appear to be her family members. The drawing shows a domestic interior setting with the family gathered together, some seated and others standing. The overall tone and style of the image suggest it is a work of art, likely from an earlier historical period. The black and white sketch or drawing depicts a group of five figures, who appear to be a mother and her children based on their varying sizes and ages. The mother figure is seated and looking off to the side, while the children are gathered around her, some sitting on her lap and others standing beside her. The sketch has a soft, hazy quality to it, giving an ethereal and timeless feeling to the family portrait composition. This appears to be a historical sketch or drawing showing a family scene, likely from the 18th century based on the style of dress and artistic technique. The artwork shows a group of six figures in a domestic setting - two adults seated on the left appear to be looking at what might be a book or document, while a woman in flowing robes sits in the center with two small children near her. Another figure stands at the right side of the composition. The drawing is done in what looks like ink or wash, with confident flowing lines and shading that gives it a gentle, intimate quality. The clothing and furnishings suggest this depicts a well-to-do household of the period. The informal, relaxed poses of the figures convey a sense of everyday family life rather than a formal portrait setting.The Family of the Earl of Granville | Drawings | European and American Art | 18th century | Drawings | Angelica Kauffmann | Swiss | Brown ink, brown wash, and white gouache over black chalk on off-white paper
This is a historical engraving titled "The Family of CORNARO a Noble Venetian" that appears to be from a collection in the Duke of Somerset's Cabinet in London. The artwork depicts a group of people in Renaissance-era Venetian clothing. The composition shows both adults and children arranged on what appears to be steps or a raised platform. The adults wear luxurious robes with fur trim, indicating their noble status, while the younger figures are dressed in simpler but still elegant attire. The scene includes decorative elements like candelabras and religious symbols in the background. The engraving technique creates fine details and subtle shading, typical of historical printmaking. The image includes text in both English and French at the bottom, suggesting it was meant for an international audience. The image depicts a family portrait from what appears to be the 16th or 17th century, based on the style of clothing. The family consists of three men, one woman, and six young boys. The men are wearing long, dark robes with fur trim, while the woman is wearing an ornate dress. The boys are dressed in similar dark clothing. They are all posed around a table or platform, with the adults standing and the children sitting or standing in front of them. The image has text below it identifying the family as the "Family of CORN/RO a Noble Venetian" in English and French. The overall scene gives an impression of a wealthy, noble family from Venice during that time period. The image depicts a large group portrait of an aristocratic family, presumably the Cornaro family, a noble Venetian family. The central figures appear to be an elderly patriarch and two other men, likely his sons or other relatives, all dressed in ornate, fur-trimmed robes and standing on a raised platform. They are surrounded by several other individuals, including women and children, who are also dressed in elaborate clothing. The scene appears to be set in a grand interior, with various decorative elements visible in the background, such as a chandelier and draperies. The overall tone of the image suggests a formal, ceremonial occasion or portrait.The Family of Cornaro, a Noble Venetian | Prints | European and American Art | 1732 | Prints | Bernard Baron
# here I drop final uneeded columns and save resultant df to a csv to use in the final recommender
dropped = ['Unnamed: 0.1','Unnamed: 0','id', 'division',
'colorcount', 'period', 'classification', 'technique', 'medium',
'colors', 'dated', 'dateend', 'department', 'people',
'url', 'publicationcount', 'totaluniquepageviews',
'totalpageviews', 'exhibitioncount',
'artist_culture']
df.drop(dropped, inplace=True, axis = 1)
df.columns
Index(['objectid', 'title', 'primaryimageurl', 'rank', 'imageid', 'body',
'image_embedding', 'date', 'artist_name', 'search_text'],
dtype='object')
#df.to_csv('final_toy_HAM4.csv')
df.to_csv('final_toy_HAMtest.csv')
Building search and recommendation Engine¶
The final output of this project is the below HAM_recommend class. It uses Jupyter widgets as UI elements to input user text query, likes, and query for recommendations. The event interface is driven by text entry and button clicks. Output is displayed by image cards via accordion widgets. Each query has its own accordion output.
The model Deepseek was used to generate a portion of the UI for the below recommendation system, namely the image cards and the accordion sections in the recommendation class. The validation of this code involves using the Recommender. I have performed approximately 50 searches post-building and have not seen any bugs or unusual behavior in the UI. The prompts used included: "I have this code (paste code) and I would like to put like buttons with my images that will save the liked indexes to my class variable.", also "I would like to try to find a way to display all history of search but make it collapsible. (paste code)"
df = pd.read_csv('final_toy_HAM4.csv')
display(df.head(2))
class HAM_recommend():
""" Creates a Jupyter widget based UI that queries user for image search text,
allows them to like images, and provides image recommendations based on liked images.
usage:
rec = HAM_recommend()
rec.show()
"""
def __init__(self):
self.df = pd.read_csv('final_toy_HAMtest.csv')
self.query = ""
self.liked = []
self.all_displayed_indices = set() # Track ALL images ever displayed
self.current_displayed_indices = set()
self.image_index = faiss.read_index("image_index4.faiss")
self.text_index = faiss.read_index("tfidf_index_test.faiss")
self.tfidf = load("tfidf_data_test.joblib")["vectorizer"]
self.main_output = widgets.Output()
self.run = True
self.image_widgets = []
# container for all batches
self.batches_container = widgets.VBox()
display(self.batches_container)
# counter for batch IDs
self.batch_counter = 0
self.image_grid = widgets.GridBox(
layout=widgets.Layout(
grid_template_columns="repeat(3, 300px)",
grid_gap="10px",
width='100%'
)
)
self.current_batch_name = None
def download_image(self, url):
"""Download single image from given url"""
try:
response = requests.get(url, timeout=8)
if response.status_code == 200:
return BytesIO(response.content).read()
return None
except:
return None
def create_image_card(self, idx, score):
"""Create a widget card for an image with like button"""
image_widget = widgets.Image(
width=280,
height=280,
layout=widgets.Layout(visibility='hidden')
)
like_button = widgets.ToggleButton(
value=idx in self.liked,
description='❤️ Like',
layout=widgets.Layout(width='auto')
)
caption = widgets.HTML(
value=f"<b>Score:</b> {score:.2f}<br>Loading image..."
)
card = widgets.VBox([
image_widget,
caption,
like_button
], layout=widgets.Layout(
border='solid 1px gray',
padding='10px',
min_height='350px'
))
like_button.index = idx
like_button.observe(self.on_like_change, names='value')
return card, image_widget, caption
def on_like_change(self, change):
""" Allows user to like or unlike image with button click,
allows creating recommendations if number of likes is more than 0"""
idx = change['owner'].index
if change['new']:
if idx not in self.liked:
self.liked.append(idx)
else:
if idx in self.liked:
self.liked.remove(idx)
self.rec_button.disabled = len(self.liked) == 0 # Update button state
def search_for_text(self, button=None):
"""Search for and return similar objects when prompted with
query text. Vectorizes querry text with tfidf vectorizor then
performs similary search with the faiss text vector store"""
self.query = self.search_text.value
with self.main_output:
clear_output()
print(f"Searching for: {self.query}")
# Create query vector
query_vec = self.tfidf.transform([self.query])
# Get results
D, I = self.text_index.search(query_vec.toarray().astype("float32"), k=12)
# Show with header
self.show_images(I[0], D[0], batch_name=f"Search Results for: '{self.query}'")
def show_images(self, indices, scores, batch_name=None, requested_count=12, fetch_multiplier=3):
"""Displays images in a new collapsible batch after filtering duplicates."""
with self.main_output:
if not batch_name:
batch_name = f"Results {self.batch_counter + 1}"
# Get more candidates than needed to account for duplicates
candidate_indices = indices[:requested_count * fetch_multiplier]
candidate_scores = scores[:requested_count * fetch_multiplier]
# Filter out already displayed images
new_indices = []
new_scores = []
for idx, score in zip(candidate_indices, candidate_scores):
if idx not in self.all_displayed_indices:
new_indices.append(idx)
new_scores.append(score)
self.all_displayed_indices.add(idx)
if len(new_indices) >= requested_count:
break # Stop when we have enough
if not new_indices:
print(f"No new images found for '{batch_name}'")
return
# Create batch UI elements
batch_id = f"batch_{self.batch_counter}"
self.batch_counter += 1
image_grid = widgets.GridBox(
layout=widgets.Layout(
grid_template_columns="repeat(3, 300px)",
grid_gap="10px",
width='100%'
)
)
# Create and load image cards
cards = []
for idx, score in zip(new_indices, new_scores):
card, img_widget, caption = self.create_image_card(idx, score)
cards.append(card)
url = self.df.iloc[idx]['primaryimageurl']
image_data = self.download_image(url)
if image_data:
img_widget.value = image_data
img_widget.layout.visibility = 'visible'
title = str(self.df.iloc[idx].title)
name = str(self.df.iloc[idx].artist_name)
date = str(self.df.iloc[idx].date)
info = title + ', ' + name + ', ' + date
caption.value = f"{info}<br><b>Score:</b> {score:.2f}"
image_grid.children = cards
# Add to history
accordion = widgets.Accordion(children=[image_grid])
accordion.set_title(0, batch_name)
accordion.selected_index = None
self.batches_container.children += (accordion,)
def search_by_likes(self, button=None):
"""Using list of liked images, extract thier embeddings, average them
and query the faiss image vector store for similar images. Filter already
seen images and display. """
if not self.liked:
with self.main_output:
print("Please like some images first!")
return
try:
liked_embeddings = [np.array(eval(self.df.iloc[idx]['image_embedding']))
for idx in self.liked]
avg_embedding = np.mean(liked_embeddings, axis=0).reshape(1, -1).astype('float32')
D, I = self.image_index.search(avg_embedding, k=36) # Get 3x more than needed
self.show_images(I[0], D[0], batch_name="Recommended Based On Your Likes")
except Exception as e:
with self.main_output:
print(f"Error getting recommendations: {str(e)}")
def clear_likes(self, button):
"""Clear all liked images and refresh the UI"""
self.liked = []
with self.main_output:
clear_output()
print("Cleared all likes")
# Update the recommendation button state
self.rec_button.disabled = True
# Update all like buttons in existing cards
for accordion in self.batches_container.children:
for grid in accordion.children:
for card in grid.children:
like_button = card.children[2] # Like button is the 3rd child
like_button.value = False
def text_query_HAM(self):
"""Create search UI box with text query button, get
recomendations button, and clear all likes button."""
search_box = widgets.HBox([
widgets.Text(description="Search:", layout=widgets.Layout(width='400px')),
widgets.Button(description="Find Images")
])
self.search_text = search_box.children[0]
search_button = search_box.children[1]
# Create recommendation button
self.rec_button = widgets.Button(
description="Get Recommendations Based on Likes",
disabled=len(self.liked) == 0,
layout=widgets.Layout(width='auto', margin='10px')
)
# Create clear likes button
clear_likes_button = widgets.Button(
description="Clear All Likes",
button_style='danger',
layout=widgets.Layout(width='auto', margin='10px')
)
# Connect buttons to their functions
search_button.on_click(self._on_search_click)
self.rec_button.on_click(lambda b: self.search_by_likes())
clear_likes_button.on_click(self.clear_likes)
# Create button container
button_row = widgets.HBox([
self.rec_button,
clear_likes_button
], layout=widgets.Layout(
justify_content='center',
margin='10px 0'
))
# display all UI elements
display(search_box)
display(button_row)
display(self.main_output)
def _on_search_click(self, button):
"""Handle search with basic spelling correction"""
self.query = self.search_text.value
self.search_for_text()
def show(self):
"""Main display method"""
self.text_query_HAM()
Sample Output¶
I spent many hours getting the recommender UI to work perfectly, only to realize that I could not easily export the results of my queries because they are widgets. I instead took screenshots of the results that are posted below. You can see that liked items have like buttons that are a darker color of gray.
Search: circus cat¶
Here I enter a query that will return two different types of imagery, 'circus' and 'cat.’ By liking only circus imagery, I am returned images that focus only on the circus aspect. The next round of recommendations, I unlike the circus themes and only images with Asian themes, and my recommender returns images based on that selection.
The results from the HAM website¶
with open("Screenshot 2025-05-06 at 1.15.28 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
My results¶
# Create and display the recommender
rec = HAM_recommend()
rec.show()
VBox()
HBox(children=(Text(value='', description='Search:', layout=Layout(width='400px')), Button(description='Find I…
HBox(children=(Button(description='Get Recommendations Based on Likes', disabled=True, layout=Layout(margin='1…
Output()
with open("Screenshot 2025-05-07 at 9.32.28 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
with open("Screenshot 2025-05-07 at 9.32.41 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
with open("Screenshot 2025-05-07 at 9.33.00 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
with open("Screenshot 2025-05-06 at 1.16.20 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
My results¶
# Create and display the recommender
rec = HAM_recommend()
rec.show()
with open("Screenshot 2025-05-07 at 9.39.23 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
with open("Screenshot 2025-05-07 at 9.39.34 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
Search: Greek Art¶
Here I liked the images most similar to classic greek statues and was returned all statues.
# Create and display the recommender
rec = HAM_recommend()
rec.show()
with open("Screenshot 2025-05-07 at 10.03.12 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
with open("Screenshot 2025-05-07 at 10.03.34 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
Searching for degas dancer¶
Here I enter the query term Degas Dancer. I next like the most colorful Degas dancer image and am recommended other bright paintings/drawings of dancers.
# Create and display the recommender
rec = HAM_recommend()
rec.show()
with open("Screenshot 2025-05-07 at 10.03.48 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
with open("Screenshot 2025-05-07 at 10.04.38 PM.png", "rb") as f:
display(Image_show(f.read(), width=1000))
References¶
CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification, Marcos V. Conde, Kerem Turgutlu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 3956-3960 [https://openaccess.thecvf.com/content/CVPR2021W/CVFAD/papers/Conde_CLIP-Art_Contrastive_Pre-Training_for_Fine-Grained_Art_Classification_CVPRW_2021_paper.pdf]
Faiss Facebook Similarity Search: Billion-scale similarity search with GPUs, Jeff Johnson, Matthijs Douze, Hervé Jégou, ArXiv 2017, [https://arxiv.org/abs/1702.08734] [https://github.com/facebookresearch/faiss/wiki/]
CLIP [https://openai.com/index/clip/]
About HAM AI Anotations [https://ai.harvardartmuseums.org/statistics]
Harvard Art Museum API docs [https://github.com/harvardartmuseums/api-docs/blob/master/sections/object.md]
Harvard Art Museum Collections Homepage [https://harvardartmuseums.org/collections]
DeepSeek [https://chat.deepseek.com]
AI Description Annotations Made by:
- Amazon Nova (via AWS Bedrock) : Description generation via prompting
- Anthropic Claude (via AWS Bedrock) : Description generation via prompting
- Clarifai : Concepts
- Google Gemini : Description generation via prompting
- Meta Llama (via AWS Bedrock) : Description generation via prompting
- Microsoft Cognitive Sources : descriptions
- Mistral Pixtral (via AWS Bedrock) : Description generation via prompting
- OpenAI GPT (via Azure) : Description generation via prompting